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ABSTRACT 

In a provocative and influential paper, Jesse Rothstein (2010) finds that standard value-added 
models (VAMs) suggest implausible future teacher effects on past student achievement, a finding 
that obviously cannot be viewed as causal. This is the basis of a falsification test (the Rothstein 
falsification test) that appears to indicate bias in VAM estimates of current teacher contributions to 
student learning. 

Rothstein’s finding is significant because there is considerable interest in using VAM teacher 
effect estimates for high-stakes teacher personnel policies, and the results of the Rothstein test cast 
considerable doubt on the notion that VAMs can be used fairly for this purpose. However, in this 
paper, we illustrate — theoretically and through simulations — plausible conditions under which the 
Rothstein falsification test rejects VAMs even when there is no bias in estimated teacher effects, and 
even when students are randomly assigned conditional on the covariates in the model. On the 
whole, our findings show that the “Rothstein falsification test” is not definitive in showing bias, 
which suggests a much more encouraging picture for those wishing to use VAM teacher effect 
estimates for policy purposes. 
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I. INTRODUCTION 

In a provocative and influential paper, “Teacher Quality in Educational Production: Tracking, 
Decay, and Student Achievement,” Jesse Rothstein (2010) reports that VAMs used to estimate the 
contribution individual teachers make toward student achievement fail falsification tests, which 
appears to suggest that VAM estimates are biased. More precisely, Rothstein shows that teachers 
assigned to students in the future have statistically significant predictive power in predicting past 
student achievement, a finding that obviously cannot be viewed as causal. Rather, the finding 
appears to signal that student-teacher sorting patterns in schools are not fully accounted for by the 
types of variables typically included in VAMs, implying a correlation between omitted variables 
affecting student achievement and teacher assignments. Rothstein presents this finding (his 
falsification test) as evidence that students are not effectively randomly assigned to teachers 
conditional on the covariates in the model. 

Rothstein’s falsification test has become a key method for academic papers to test the validity 
of VAM specifications. Koedel and Betts (2009), for instance, argue that there is little evidence of 
bias based on the Rothstein test when the VAM teacher effect estimates are based on teachers 
observed in multiple classrooms over time, and in their analysis of the impacts of teacher training. 
Harris and Sass (2010) report selecting school districts in which the Rothstein test does not falsify 
when estimating the impacts of teacher training programs. Briggs and Domingue (201 1) say that 
they use the Rothstein test to critique value-added results that were produced for the Los Angeles 
public schools and later publicly released. 

Rothstein’s finding has considerable relevance because there is great interest in using VAM 
teacher effect estimates for policy purposes such as pay for performance (Podgursky and Springer 
2007; Eckert and Dabrowski 2010) or determining which teachers maintain their eligibility to teach 
after some specified period of time, such as when tenure is granted (Goldhaber and Hansen 2010a; 
Gordon et al. 2006; Hanushek 2009). Many would argue that, if VAMs are shown to produce biased 
teacher effect estimates, it casts doubt upon the notion that they can be used for such high-stakes 
policy purposes. Indeed, this is how the Rothstein findings have been interpreted. For instance, in 
an article published in Education Week, Debra Viadero (2009) interprets the Rothstein paper to 
suggest that “Value-added’ methods for determining the effectiveness of classroom teachers are 
built on some shaky assumptions and may be misleading.” 1 2 As Rothstein (2010) himself said, the 
“results indicate that policies based on these VAMs will reward or punish teachers who do not 
deserve it and fail to reward or punish teachers who do.” In a recent congressional briefing. 


1 Other researchers have come to somewhat different conclusions about whether VAMs are likely to produce 
biased estimates of teacher effectiveness. Kane and Staiger (2008), for instance, find that certain VAM specifications 
produce teacher effect estimates that are similar to those produced under experimental conditions, a result supported by 
nonexperimental findings of Chetty et al. (2011), who exploit differences in the estimated effectiveness of teachers who 
transition between grades or schools to test for bias. And while Koedel and Betts (2009) confirm Rothstein’s basic 
findings about single -year teacher effect estimates, they report finding no evidence of bias based on the Rothstein 
falsification test when the VAM teacher effect estimates are based on teachers observed in multiple classrooms over 
time. 


2 Others have cited his work in similar ways (Hanushek and Rivkin 2010; Rothman 2010; Baker et al. 2010; and 
Kraemer et al. 2011). 
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Rothstein cited his falsification test results and said that value-added is “not fair to... special needs 
teachers. . . [or] other specialists .” 3 

In this paper, we show that the Rothstein falsification test can reject when there is no bias. 
Specifically, we describe plausible conditions under which the Rothstein test rejects the null 
hypothesis of no impacts of future teachers on lagged achievement even when students are 
randomly assigned conditional on the covariates in the model and there is no bias in estimated 
teacher effects. We verify these conditions theoretically and through a series of simulations. Our 
findings are important because, as noted above, the Rothstein test is shaping not only academic 
studies but also public perception about the efficacy of utilizing VAMs. 


3 Haertel et al. (2011). 
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II. THE ROTHSTEIN FALSIFICATION TEST AND BIAS 
A. The Value- Added Model Formulation 

There is a growing body of literature that examines the implications of using VAMs in an 
attempt to identify causal impacts of schooling inputs and contributions of individual teachers 
toward student learning (for example, Chetty et al. 2011; Ballou et al. 2004; Goldhaber and Hansen 
2010a; Kane and Staiger, 2008; McCaffrey et al. 2004, 2009; Rothstein, 2009, 2010; Rockoff, 2004). 
Researchers typically assume their data can be modeled by some variant of the following equation: 

(!) A i g = a+XA^+Ep^tig-Hig 

where A- is the achievement of student i in grade g, 

a is the intercept and also the value added of the omitted teacher, 4 
A, is the impact of lagged achievement, 

x tiir is a dummy variable identifying if student i had teacher t in grade g, 

/?/ is the impact of teacher t compared to the omitted teacher, 5 and 

e ig is an error term that represents other factors that affect student learning. 6 7 

If e lg is uncorrelated with the other variables in equation 1 , then the impact of teacher t can be 
estimated by regressing student achievement (A ig ) on prior student achievement (A i?g1 ,) and dummy 
variables identifying the teacher student i had in grade g (x tig ). ' Of course, this model makes a 
number of assumptions about the nature of student learning; see, for instance, Harris et al. (2010), 
Rothstein (2010), or Todd and Wolpin (2003) for more background on these assumptions. 8 


4 The intercept equals the value of the outcome when all other variables are set to 0. If the achievement scores 
(current and baseline) are mean centered, the intercept equals the outcome for a student with an average baseline score 
who has the omitted teacher. 

5 In the rest of this paper, we generally refer to p t as the impact of teacher t. This can be thought of as equivalent 
to saying that the results are normalized so that the impact of the omitted teacher is 0. 

6 This model is similar to the VAM2 model discussed by Rothstein (2010), which also allows the coefficient on 
lagged achievement to differ from 1 and excludes student-fixed effects. Value-added models often also include school 
and grade fixed effects and a vector of student and family background characteristics (for example, age, disability, 
English language status, free or reduced-price lunch status, race, ethnicity, whether a student has previously been 
retained in grade, and parental education). These factors could be incorporated into our models with no changes in the 
substantive findings. Rothstein (2010) also discusses student-fixed effects and measurement error, two issues that we 
address below. 

7 Like much of the value-added literature, Rothstein does not try to separate classroom and teacher effects. 

8 Note that, even if VAMs produce effect estimates that are unbiased, they may not be very reliable. For more on 
this, see Goldhaber and Hansen (2010b), McCaffrey et al. (2009), and Schochet and Chiang (2010). 
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Rothstein questions whether standard VAMs produce unbiased estimates of teacher effect on 
student learning. 4 In particular, he describes a number of ways in which the processes governing the 
assignment of students to teachers may lead to erroneous conclusions about teacher effectiveness. 
Of particular concern is the possibility that students may be tracked into particular classrooms based 
on skills that are not accounted for by A i(g4) . 9 10 

B. Bias in VAM Teacher Effect Estimates 

If equation 1 is estimated using ordinary least squares, the estimated impacts of a teacher can be 
biased for a number of reasons. To derive a formula for this bias, we divide the error term (ck) into 
two components — ov iK , which, if it exists , we assume to be correlated with at least some of the 
covariates included in the model even after controlling for the others, and eu ig , which is uncorrelated 
with any of the covariates in the model. 

Thus, 

*i g =Y’ ov i g + «»« 

where y is the coefficient on ov ig . y is also the coefficient that would be obtained on ov ig , were it 
added to equation 1 . 

It should be noted that, by definition, if ov ig exists it would cause bias for at least some 
coefficient estimates. However, as we describe below, it can exist and not necessarily cause bias for 
the estimated teacher effects. 

The general formula for omitted variable bias for the effect of teacher t takes the form: 


Bias(fi t ) = E(fi t ) -p t = yx te 


where 


fit — estimate of ft and 


7T te — coefficient on T tlJ , from a regression of ov ig on all right-hand side variables in equation 1 
(except eQ. 


It can be shown that, 


9 Parameter estimates can often be consistently estimated with large sample sizes but remain biased when sample 
sizes are small relative to the number of parameters being estimated. In this paper, we focus primarily on situations with 
large sample sizes. Kinsler (2011) investigates what happens to the Rothstein test when smaller sample sizes are used. 

10 This could happen for a number of reasons. For example, principals might assign students to teachers based in 
part on unobserved factors that impact current achievement but are not captured by lagged achievement. Principals 
might do this either because they believe that certain teachers have a comparative advantage in educating certain 
students or because the principals are rewarding certain teachers with choice classroom assignments. Similar results 
could hold if parents lobby to have their children assigned to particular teachers. 
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(2) 1 1 7T te - cov(ov* ig , T* dg ) / V (T* dg ) 

where ov* iE and T* tie are the residual values of ov ig and T dg that remain after controlling for other 
variables in a linear regression model. 12 

This formulation is helpful because it shows that only the residual value of ov ig matters for bias 
in the estimated teacher effects. In other words, ov ig would not cause bias in the estimated teacher 
effects if its residual is uncorrelated with the residual of T tig . 13 

If one estimates a linear VAM, then the residual of ov ig would also be based on a linear 
regression. Consequendy, another way to state the condition of obtaining unbiased teacher effects in 
the presence of an omitted variable in a linear VAM is as follows: if the omitted variable is a linear 
function of lagged achievement and the other control variables in the equation, then it need not 
cause bias in the estimated teacher effects because the partial correlation is picked up by the included 
variables. 

If the omitted variable is a nonlinear funcdon of lagged achievement, then it is more likely to 
cause bias. However, even in this case, it may still be possible to obtain unbiased estimates of the 
teacher effects if the omitted variable is time-invariant (that is, ov ig =oVj). It may be possible, 
however, to address the issue of tracking based on time -invariant factors through the inclusion of a 
student fixed effect, which captures time -invariant student or family background attributes that 
affect current achievement in ways not captured by previous achievement. 14 But one of the 
significant contributions of Rothstein’s work is that he raises an additional concern about VAMs: he 
postulates that student sorting into classrooms is “dynamic” in the sense that an omitted factor that 
affects the error term may be time-varying and correlated with placement into classrooms (that is, a 
form of tracking). This could lead to bias. He specifically suggests that the omitted factors cause the 
errors to be negatively correlated over time. 15 This could be due to compensating behavior whereby 
students who have a good year (along unobserved lines) receive fewer inputs in the following year 


11 Equation 2 holds because each coefficient estimate from any linear regression can be obtained by regressing the 
residualized outcome on the residualized version of the corresponding right-hand side variable in that equation. Each 
residualized variable is equal to the residual obtained after regressing the original variable on all of the other right-hand 
side variables in the equation (Goldberger 1991). 

12 We use the abbreviation “cov” for “covariance” and “var” for “variance” in equations throughout the paper. We 
use “*” to indicate a residual from a linear regression of the variable in question on lagged achievement. We could also 
write ovig = 7t X e xi t g + ovig* where xj tg represents all the covariates in the model except T dg and Tt xe is a vector of 
coefficients on those variables. In our example, the vector xj tg includes lagged achievement and all other teacher variables 
in equation 1. 

13 The variable ovi g could still cause bias in the estimated impacts of other covariates, like the lagged achievement 
variable, and therefore be considered an omitted variable for the regression. 

14 Studies differ on the efficacy of including student fixed effects — Harris and Sass (2010) argue for it, while Kane 
and Staiger (2008) argue against. Koedel and Betts (2009) find that student fixed effects are statistically insignificant in 
the models they estimate. 

15 Rothstein’s evidence of a negative correlation of errors over time is derived from a VAM without student fixed 
effects. If important time-invariant omitted student factors exist, implying the need for student fixed effects, we would 
expect to see a positive correlation across grade levels between the errors in a model estimated without student fixed 
effects. Given this, it seems unlikely that student fixed effects explain Rothstein’s finding of bias. 
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than students who are otherwise similar. 16 This dynamic form of tracking cannot be accounted for 
by the inclusion of a simple student fixed effect. 

C. Rothstein’s Falsification Test 

Rothstein’s falsification test relies on the notion that evidence of the statistical significance of 
future teachers in predicting past achievement suggests a misspecification of the VAM (Rothstein 
2010). Similar falsification tests are often used in the economics literature (Heckman and Hotz 1989; 
Ashenfelter 1978). However, as discussed below, these tests are generally used to check the 
specification of models that differ in important ways from VAM. 

We focus our paper on the version of the Rothstein test described in footnote 13 of his paper, 
where Rothstein states that . .random assignment conditional on A ; , will be rejected if the grade-g 
classroom teachers predict A ig2 conditional on A ig] ” (this is described more formally in the next 
subsection). 1 ' Rothstein implements this test using a linear regression. For example, achievement in 
grade 3 can be regressed on grade 4 achievement and grade 5 teachers. The test is whether the 
coefficients on the grade 5 teachers are jointly significant. 

Rejecting the null of no future teacher effects might then be taken as evidence of bias. In other 
words, many people may assume it to imply a correlation between the independent variation in 
teacher assignments and the error in the achievement equation. But a more precise formulation of 
the necessary exclusion restriction for unbiased teacher effects is that current grade teacher 
assignments are orthogonal to all residual determinants of the current score that are not accounted for 
by the exogenous variables included as controls in the model. This does not, however, mean that 
future grade teacher assignments need necessarily be orthogonal to the residual determinants of past 
scores. Rothstein’s falsification test does provide evidence of tracking and that tracking could cause 
VAM misspecification. But, as we show below, the Rothstein test will reject, even in the absence of 
any omitted variables that might lead to biased teacher effect estimates. 

D. Comparing the Rothstein Falsification Test to a Formal Condition for 
Bias 

The central connection between the Rothstein falsification test and a formal condition for bias 
is through the tracking of students. However, tracking of students to future teachers (implied by the 
Rothstein test) need not imply that students are tracked to current teachers. For example, students 
can be tracked into future teacher classrooms based, at least in part, on past achievement, even if 
they were randomly assigned to current teachers. For this reason, the Rothstein test only applies to 
bias for current teacher effect estimates if tracking systems are consistent across grades. Consistent 
tracking across grades also means that we can lag the Rothstein test one period and get the same 


16 This could happen if family contributions to achievement are negatively correlated over time, conditional on past 
achievement. Alternatively, this could happen if the impacts of the error term decay at a different rate than other factors. 
By allowing the coefficient on lagged achievement to be less than one, the VAM we consider allows for decay, but 
implicitly assumes that the decay is the same for all factors (prior achievement, teachers, and the error term). 

17 It is important to note that the outcome in this equation is grade g-2 achievement. If one were to instead look at 
the impact of grade g teachers on grade g-1 achievement and control for grade g-2 achievement, then the nature of the 
test changes, although many of the same general properties hold. In particular, the test will often reject when there is no 
bias for estimated teacher effects. Results based on this test are available upon request. 
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results. We do this because it provides a means of concisely comparing the Rothstein falsification 
and bias tests. 

For simplicity, in comparing the Rothstein and bias tests in this section of our paper we utilize a 
simple VAM with only one school and two teachers in each grade (in Section III we show 
simulations with more complex data stmctures). The dummy for one teacher is omitted from the 
regression and the current grade teacher assignment depends entirely on lagged achievement. Thus: 

(3) 18 A ig = A,A i(g _ t) + P lg Ti ig +<? ig 

where covfy^A;^) = cov(<? ig ,T lig ) = 0. 

We specify a flexible functional form for tracking for teacher 1 (the one with higher value 
added): 

T n g = T(A i(g . 1) ) 19 

Rothstein’s falsification test is based on a regression of lagged achievement on current 
achievement and future teachers. 20 As noted earlier, to simplify our discussion we lag the Rothstein 
test one period so that instead of focusing on the relationship between future teachers and lagged 
achievement, the test is based on the relationship between current teachers and double-lagged 
achievement. This enables us to compare the conditions for bias and the Rothstein test using the 
same set of teachers. If the possible sources of bias are the same across grades, this simplification 
has no effect on our results. 

The Rothstein test using future teachers can be written as: 

^i(g-l) ~ + R 1 (g+l) T li(g+l) +W i(g-l) 

where R 1(g+1) describes the regression-adjusted relationship between future teachers and lagged 
achievement. 

When we lag this one period, we get, 

(4) A i(g - 2) - A.lA i(g4) + R lg x lig +w i(g _ 2) 

The Rothstein test involves estimating whether or not R , ,, differs from 0. 


18 To simplify our presentation, we omit intercepts and measurement error from these equations. Adding them 
back does not change our substantive findings. Numerous methods exist to obtain unbiased estimates in the presence of 
such measurement error (Potamites et al. 2009; Meyer 1999; Fuller 1987). 

19 TO is bounded between 0 and 1 and dT()/ dh^ g A) > 0. 

20 In Table IV of his paper, he uses a similar test in which he adds in current teachers as an additional set of control 
variables in the model estimating future teacher effects. As shown in our simulations, this second test will also often 
reject when there is no bias. 

21 As shown earlier, footnote 13 of Rothstein’s paper uses the same grade levels — g, g-1, and g-2. 
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The numerator in R Ig is the covariance between current teachers and the residual from a 
regression of double-lagged achievement on current achievement. -2 

(5) A( g -2) - ^2 A ig +U i(r2) 

If R 1? is 0 then X2 equals XI and u i(g _ 2) equals w i(g _ 2) . Thus, one can test to see if R 1? differs from 0 
using the following covariance: 

cov ( T ii g > u i(g-2) I Vj)<>o 3 

For comparison, here is the condition for obtaining biased estimates: 

cov(x lig , eQ | A i(g4) )<>0 

Formulated in this way, the Rothstein falsification test and formal condition for bias appear 
quite similar but differ in a key respect — e lg is not the same as u 1(g _ 2) . One can, therefore, generate data 
in which there is no bias but the Rothstein test rejects (and vice versa). 

We break up the error term from equation 5 into three pieces to show factors that might cause 
the Rothstein test to falsify unbiased VAMs. Rearranging and substituting out for A i(g4) yields: 

u i(g-2) ~ Ai(g-2) - ^2(XA i(g _ 2) + Pl (g -i ) % {g -i) + %-\ )) 

(6) = (1- X2X) A i(g _ 2) - X2(3l (g4) T li(g4) - X2 * i(g4) 24 

Equation 6 shows that u i(g _ 2) can be written as a linear function of three variables — A i(g _ 2) , x li(g _ 
,s, and <? i(g4) .The Rothstein test is based on testing whether cov(x lig , u i(g _ 2) | A i(g4) )<>0, which means it 
will reject if any of the following three conditions hold: 

(Cl) cov(x ligJ A i(g _ 2) | A i(g4 ))<>0 

(C2) cov(x lig ,t li(g4) | A i(g4) )<>0 

(C3) cov(x lig ,e i(g4) | A i(g4) ) < > 0 

E. Exploring the Conditions That Cause the Rothstein Test to Falsify 

The first condition that would cause the Rothstein test to reject is that current teachers are 
conditionally correlated with double -lagged achievement, after controlling for lagged achievement in 
a linear regression. This is not implausible, as school systems may not have ready access to lagged 


22 More precisely, Ri g = cov(T^ jg , A i(g . 2 ) | A ife _i))/ var(xi > ; >g | A i(g _i)) = cov(Xi 4>g , u^) | A i(g _i))/ var(xi,;, g | A i(g _i)) since u i(g . 2 ) 
is the residual that remains after regressing Ai(g_ 2 ) on A;(g_i). 

23 Throughout, we use cov(x,y| Aijg.i)) to describe the covariance that remains between variables x and y after 
controlling for Ai^.i) using a linear regression. 

24 This equation is similar to Rothstein’s equation 7 except that it includes double -lagged achievement on the right- 
hand side. 
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achievement scores at the point at which teacher assignment decisions are made. 15 Consequently, 
many schools may use double -lagged achievement for tracking decisions. 

The second condition that would cause the Rothstein test to reject occurs when current 
teachers are conditionally correlated with lagged teachers, after controlling for lagged achievement. 
This is likely if some classrooms disproportionately consist of students who shared the same 
classroom in the previous year, perhaps because schools intentionally keep certain students together 
(or apart). We refer to this as “classroom tracking.” 26 

Conditions 1 and 2 both relate to the possibility that there is a variable left out of the VAM that 
affects tracking and that might, therefore, cause bias. However, if these variables (double-lagged 
achievement and lagged teachers) affect current achievement only through their impacts on lagged 
achievement, as is implied by equation 3, then they may cause the Rothstein test to reject, but their 
omission from the VAM will cause no bias. 

The third condition that would cause the Rothstein test to reject occurs when the current 
teacher is conditionally correlated with the lagged error term, after controlling for lagged 
achievement. This condition might seem unlikely to matter since e [(g _ V) is part of A i(g4) . Thus, one 
might assume that controlling for A i(g4) would account for a correlation between f? i(g4) and T lig . 
However, this turns out not to be the case because the falsification test is linear while both the 
current teacher and double-lagged achievement are nonlinear functions of lagged achievement and 
the nonlinearities can be correlated.” Consequently, the linearly based falsification test can suggest 
implausible current teacher impacts on double-lagged achievement in an unbiased VAM. We 
describe this issue in detail in Appendix A and show evidence of this in our simulations. 


25 For instance, Mathematica does value-added work in various states and localities where teacher effect estimates 
are needed in a timely way to inform key policy decisions. In many of these locations, state achievement score data from 
the spring of one school year are often not available until the fall of the following school year, too late to affect tracking 
decisions for that year. See, for example, Potamites et al. (2009) and Chaplin et al. (2009). 

26 There are alternative tests for bias caused by these types of tracking. For example, for the first condition, one can 
include double-lagged achievement in the VAM model and test to see if the estimated coefficient estimates on current 
teachers change compared to a model without that variable (Rothstein 2009). Similarly, for the second condition, one 
can add lagged teachers to a standard VAM. Results of this later test are likely to be very imprecise for many teachers, 
especially in smaller schools, if most of their students come from a single lagged teacher. 

27 Current teachers are a nonlinear function of lagged achievement because the tracking equation is bounded 
between 0 and 1. Double-lagged achievement can be described using a nonlinear function of lagged achievement 
because the lagged teachers create discontinuous jumps in lagged achievement that are not in double-lagged 
achievement. The two sources of nonlinearity can be correlated because both depend on lagged achievement. This result 
suggests that the Rothstein falsification test may not work well for VAM. This does not mean that falsification tests 
would not work in other situations. For example, Fleckman and Flotz (1989) propose a general falsification test for 
nonexperimental estimators based on the same underlying concept as Rothstein — that a treatment cannot affect past 
outcomes. They were looking at the impacts of a job training program. Rothstein (2010) cites Ashenfelter (1978) on this 
issue. Ashenfelter was also looking at impacts of a job training program, job training programs are often taken only once 
every few years. Flence, there may be no analogy to the lagged teacher that would generate nonlinearities in the 
relationship between lagged earnings and double-lagged earnings that are correlated with the current training program, 
like the nonlinearities in the relationship between lagged achievement and double-lagged achievement that are likely in 
VAM. Flence, while our results suggest concern regarding the use of falsification tests for a VAM, they do not rule out 
the use of falsification tests in other situations, such as those considered by Fleckman and Flotz (1989), and even in 
other education research in which the goal is to estimate effects of programs or policies that target large numbers of 
classrooms. 
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This third condition is important because Rothstein (2010) expresses particular interest in the 
distribution of error terms. In particular, he acknowledges that evidence that future teachers are 
statistically significant predictors of past achievement is not itself proof of bias for current teachers. 28 
He goes on to say that tracking accompanied by negative correlation in the errors across grades (that 
is, cov(e jg ,e (ig _ V) ) < 0) “strongly suggests” bias for current grade teachers. More precisely he says, 

“A correlation between treatment and some pre-assignment variable X need not indicate 
bias in the estimated treatment effect if X is uncorrelated with the outcome variable of 
interest. But outcomes are typically correlated within individuals over time, so an 
association between treatment and the lagged outcome strongly suggests that the treatment 
is not exogenous with respect to post-treatment outcomes (Rothstein 2010).” 

We agree that, if the errors are correlated (either negatively or positively) and there is tracking, 
then the teacher effects will probably be biased (see Appendix B for more detail on this point) and 
the Rothstein falsification test will reject based on condition 3 (see Appendix A). However, as we 
show, the test will also reject based on conditions 1, 2, or 3 even without negatively correlated errors. Thus, 
one cannot use the test to definitively identify bias caused by negatively correlated errors. Similarly, 
one also cannot use the Rothstein test to definitively identify variables left out of the VAM that 
might cause bias (such as double -lagged achievement or lagged teachers) since the test will reject 
based on condition 3 even without conditions 1 or 2. 


28 More generally, in personal correspondence with others (see Chetty et al. 2011, footnote 53), Rothstein has 
stated that his test is “neither necessary nor sufficient for there to be bias in a VA estimate.” Rather, his test suggests 
cause for concern about bias that might be caused by unobservables. We view our findings as showing conditions under 
which that bias might also be small. Chetty et al. (201 1) present nonexperimental evidence suggesting small bias caused 
by unobservables, supporting earlier experimental findings by Kane and Staiger (2008). 
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III. SIMULATION RESULTS 

We performed a number of simulations that illustrate the findings reported in the preceding 
section. For consistency with Rothstein (2010), we simulated data for grades 3 through 5. 29 We 
conducted Rothstein falsification tests (described by equation 4 above) and tests for bias for the 
estimated grade 5 teacher effects. 1 " The errors in the achievement equations are jointly normally 
distributed and have standard deviations of 0.4. We set the coefficient on lagged achievement to 
either 0.91 or 0.95, depending on the model, to keep the achievement level standard deviations close 
to l. 1 The standard deviations of grade 5 teachers were set to 0.1. This is the value Rothstein used 
for his baseline model in Appendix C of his 2010 paper and is in the range of the estimates reported 
by Hanushek and Rivkin (2010). The standard deviations for grades 3 and 4 teachers were set to 
either 0. 1 or 0, depending on the model. 

We simulated data for 200 schools with four teachers per school and 20 students per teacher for 
a total of 800 teachers and 16,000 students in each model. 

We simulated data setting the school effects to zero but controlled for school effects both in 
our VAMs and in the Rothstein tests. Thus, we estimated effects for three teachers in each school, 
or a total of 600 teachers across the sample. 

We based tracking on various subsets of the following five factors: (1) previous achievement, 
(2) double-lagged achievement, (3) the previous teacher (a dummy variable), (4) a random 
component that has a standard deviation of 0.2, and (5) an omitted variable (discussed below) with a 
standard deviation of 0.2. Within schools, we split students into four groups of 20 each based on an 
indicator variable equal to the sum of the tracking factors used. The factors used vary depending on 
the model, as described in Table 1. In some models, the achievement error terms have a negative 
correlation of around -0.25. 12 


Rothstein (2010) finds almost no correlation in errors across two periods. Rather, he only finds 
correlations in errors across contiguous periods. To be consistent with his evidence, we generated 
errors with these properties (correlations between contiguous periods but no correlation across non- 
contiguous periods). More precisely, we generated errors using the following formula: 


% ~ w 1 u ig +w 2 u i(g . 1) 


29 We started with normally distributed achievement in grade 2 and then added in teacher effects and normally 
distributed errors for achievement in grades 3, 4, and 5. 

30 The grade 5 teachers can be thought of as future teachers in grade 4 using the regular Rothstein test or current 
teachers using our revised test (lagging the Rothstein test one period). As noted earlier, if tracking systems are stable 
across grades, then the choice of grade levels will not matter. 

31 The choice of the coefficient on lagged achievement does not impact our substantive findings as long as it is not 
zero. The achievement scores all have standard deviations between 0.98 and 1.01. 

32 Rothstein simulates data in Appendix C of his paper using a negative correlation of -0.25. He reports correlations 
of — .21 for math and —.19 for reading for the residuals from the VAM based on the North Carolina data he analyzes. 
These estimates are not adjusted for measurement error. He does report that the correlations are too large to be caused 
by measurement error for his “VAM1” model, which assumes a coefficient of one on lagged achievement. 
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where w, and w 2 are weights chosen to generate errors with the specified variances and correlations 
across grades. The u ig variables are uncorrelated across grades so the errors separated by two or 
more periods are also uncorrelated. 

As noted above, some models include an omitted variable that impacts tracking decisions. That 
variable also impacts current achievement scores. It is not correlated with lagged achievement. 

For each model, we tested to see if the model was rejected using the Rothstein falsification test 
and also if the impact estimates for grade 5 teachers were biased. 3 ’ To test for bias, we did a joint 
test of the difference of each of the teacher effect estimates from their true values, which were used 
to simulate the data (allowing for correlations across the estimates). We describe the magnitude of 
the estimated future teacher effects from the Rothstein test using their standard deviations. We show 
correlations between the estimated teacher effects and the tme effects as a way of assessing the 
magnitude of the bias. 

We present five sets of results in Table 1. The first set of columns (under “Results by 
Condition”) demonstrates how the Rothstein test performed based on the three conditions 
discussed above. The second set of columns (under “Linear Falsification Test”) covers findings 
when grade 4 achievement was linearly associated with grade 3 achievement; these results are useful 
to show why the Rothstein test rejects based on the third condition described above. The third set 
of columns (under “Negatively Correlated Errors”) describes conditions under which the Rothstein 
test worked in the sense that it falsified the model when the VAM produced biased teacher effect 
estimates. The fourth set of columns (under “Failing to Falsify”) presents cases in which the 
Rothstein test failed to falsify when it should have done so. 34 The last column (under “Rejecting 
RA”) shows that the Rothstein test can reject in spite of random assignment (RA) of teachers to 
classrooms. 

The first four rows show the parameters used to generate the data for each model. Blanks 
indicate zero values. The last six rows of Table 1 show our results. The bias test is a joint test of the 
difference between the estimated teacher effects and the tme teacher effects from equation 1 above, 
accounting for the fact that one teacher in each school was dropped. The Rothstein test is a joint 
test of the significance of current teachers for predicting double -lagged achievement controlling for 
lagged achievement (like equation 4 above but with multiple teachers and with school effects). The 
standard deviation of the VAM estimates comes from the VAM results. The standard deviation of 
the Rothstein estimates comes from the coefficients on the teacher dummies produced in the 
Rothstein test. The raw correlation between the true and estimated teacher effects is Raw cor (/?,,/?,) 
and Adjusted cor (/?,,/?,) is the correlation adjusted for estimation error. 


33 To estimate the standard deviation of current teacher effect estimates, we use estimates based on equation 1. 

34 This is a point he acknowledges is possible in his paper. 
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Table 1. Simulation Results for Rothstein Falsification Test and Bias Test, by Model 


Parameter 

Results by Condition 

Linear Falsification Test 

Negatively Correlated 
Errors 

Failing to Falsify 

Rejecting 

RA 

Result 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

Teachers Tracked on 

A v A 3 

A ,T 

4 4 

A 4 

A.A 

A ,T 

4 4 

A 4 

Rand 

A 4 

A 4 

ov* 

ig 

ov* ,A 

ig, 4 

\ 

Cor < e ; g ' c u g - 1 ,)= 







-0.25 

-0.25 

-0.25 


-0.25 


Var( ov* ) 

ig 










0.2 

0.2 


P 3 - P 4 

0.1 

0.1 

0.1 




0.1 

0.1 


0.1 


0.1 

Bias Test 

0.95 

0.95 

0.98 

1.01 

0.96 

0.92 

1.01 

1.00 

0.88 



1.02 

Rothstein Test 

8. 34 

8.91 

1.27 

7.1 1 

7.97 

1.03 

0.99 

1.18 

0.96 

1.00 

0.97 

2.01 

Std Dev VAM Estimates 

0.132 

0.137 

0.137 

0.132 

0.138 

0.129 

0.138 

0.135 

0.133 

0.189 

0.230 

0.133 

Std Dev Rothstein Estimates 

0.331 

0.292 

0.149 

0.311 

0.278 

0.130 

0.129 

0.144 

0.125 

0.133 

0.126 

0.174 

Raw cor(P t , p , ) 

0.71 

0.76 

0.75 

0.71 

0.74 

0.71 

0.73 

0.73 

0.75 

0.52 

0.42 

0.71 

Adjusted cor(P t , p, ) 

0.99 

1.00 

1.00 

0.98 

0.98 

0.99 

0.97 

0.99 

1.00 

0.60 

0.47 

0.97 


Notes: Blanks indicate zeros. Tracking is based on the sum of the variables indicated in the table and a random error with standard deviation of 0.2 

Rand means this random error was the only one used for tracking. School effects are set to zero. Achievement has a standard deviation close to 
one. The errors e , e and e are in the achievement equations and have standard deviations of 0.4. The variable, ov* , affects both 
achievement and tracking in grade 5, but is uncorrelated with previous grade variables. Lambda is 0.91 in all models except for 7, 8, 9, and 11, 
where it is 0.95. The bias test is a joint test of the difference between the estimated and true teacher effects in a standard VAM. The Rothstein 
test is described in the text. The test statistics are F- tests. The cut- point for 5 percent statistical significance given our sample sizes is 1.11. 
Test statistics in bold are significant at the 5 percent level. The standard deviation of grade 5 teacher effects is 0.1 in all models. Each model has 
200 schools, four teachers per school, and 20 students per teacher. The regressions control for school effects so we only estimate teacher 
effects for 600 teachers in each model (three teachers per school and 200 schools). 

Adjusted cor(P t , p,) is an estimate of the correlation that would be observed between p and p, in the absence of any estimation error (Spearman 
1904; Goldhaber and Hansen 2010b). 
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The first three columns show cases in which the Rothstein test rejected but there was no bias, 
relating them to the three conditions discussed above. For the first two conditions, the reason for 
the Rothstein test to reject seems fairly clear — current teachers were selected in part based on 
double-lagged achievement and/ or lagged teachers, both of which are components of the error term 
from the Rothstein test. Hence, current teachers are correlated with that error. In the third column, 
however, neither condition was present and yet the Rothstein test still rejected. Here the false 
rejection was caused by the fact that the bivariate relationship between double -lagged achievement 
and lagged achievement is nonlinear, whereas the Rothstein falsification test is linear, as discussed 
earlier. The results in columns 4 through 9 of Table 1 help to illustrate this and Appendix A explains 
in more detail why this is possible. 

One of Rothstein’s findings is that the magnitudes of the future teacher effects are quite large. 
We found this result in models without bias. Indeed, even in column 3, when rejection is due only to 
the nonlinearity issue, the estimated future teacher effects from the Rothstein test are still about the 
same size as the estimated current teacher effects from the VAM. 

The fact that the grade 5 future teacher effects are noticeable for conditions 1 and 2 might not 
seem surprising, given that those conditions imply that variables that affected tracking were left out 
of the VAM equation. The fact that the nonlinearities cause such a large future teacher coefficient 
estimate for condition 3 might seem more surprising. One reason for this is that the denominator of 
the coefficient estimate on the grade 5 teacher is the variance in the grade 5 teacher that remains 
after controlling for lagged achievement. If lagged achievement predicts the current teacher well in a 
linear model, then the denominator (residual variance) may be small, resulting in a relatively large 
grade 5 teacher coefficient estimate. This means that the “future teacher effects” produced by the 
Rothstein test have magnitudes similar to true teacher effects in models with little or no bias. Thus, 
their magnitudes could be misleading in regard to the magnitude of the bias identified. 

The results presented in the columns under “Linear Falsification Test” are based on data in 
which the baseline (grade 4) scores were linearly related to the grade 3 scores. We did this by setting 
the grade 3 and 4 teacher effects to zero. As explained in Appendix B, this enables the grade 3 and 4 
achievement scores to be linearly related. 

As illustrated in columns 4 and 5 in Table 1 (under “Linear Falsification Test”), the Rothstein 
test still rejects if tracking is based on either grade 3 achievement or grade 4 teachers. 36 However, if 
neither of those conditions hold, then the Rothstein test no longer rejects, as can be seen in column 
6. This is important because it highlights the fact that the Rothstein test could be used to provide 
evidence that grade 5 students were tracked on either grade 3 (double -lagged) achievement or grade 
4 (lagged) teachers (conditions 1 and 2) or something correlated with those variables even after 
controlling for grade 4 achievement, if the grade 3 and 4 achievement levels were linearly related. 


35 In Table 1, we present results generated using the Rothstein test described in equation 4 above. We also ran 
another variation of the Rothstein test for all of the models presented in Table 1, which we call the Rothstein 2 test. It 
was also used in his paper. This involved adding grade 4 teachers to equation 4. For the first three columns of Table 1, 
the results were similar in the sense that the standard deviations of the estimated future teacher effects were always larger 
than those for the VAM estimates for current teachers. 

36 The standard deviations of the future teacher effects based on the Rothstein 2 test are larger than the standard 
deviations of the VAM estimates for these models. 
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In columns 7, 8, and 9 (under “Negatively Correlated Errors”), we present conditions under 
which the Rothstein test did identify bias. In column 7, we generated data with no tracking. 
Therefore there was no bias and the Rothstein test did not reject. This is in spite of the fact that the 
model had negatively correlated errors and lagged teacher effects so that the grade 3 and 4 test 
scores were no longer linearly related. When there was tracking (as in column 8), there was bias and 
the Rothstein test appropriately rejected. However, the amount of bias may have little policy 
relevance as the correlation between the estimated and tme teacher effects was close to 1 after 
adjusting for estimation error. 37 In addition, we did not find statistically significant bias here though 
this was due only to a lack of precision. 18 

The results in column 9 help to illustrate why the model used in column 8 resulted in so little 
bias. In particular, in column 9, we show that if the baseline scores and double-lagged scores were 
linearly related (that is, there were no lagged teacher effects), then there would be no bias. Under 
those conditions, the baseline test can control for the negatively correlated errors (as discussed in 
Appendix B). 39 While it is not likely that baseline and double-lagged scores are linearly related, they 
may be almost linearly related given the small magnitude of the lag teacher effects relative to the 
overall variance of achievement. 

If we knew that the error terms were negatively correlated but were unsure if there was tracking, 
then the Rothstein test would provide evidence of at least some bias. However, given that tracking in 
schools is quite likely, evidence of negatively correlated errors is itself evidence of bias. And, as noted earlier, one 
cannot use the Rothstein test to check for bias caused by negatively correlated errors because it is 
not possible to determine whether the test is rejecting because of the negative correlation or because 
of one of the other conditions specified above. 

In columns 10 and 11 (under “Failing to Falsify”), we present cases in which the Rothstein test 
failed to falsify, but the estimated teacher effects were actually biased. We obtained these results by 
creating an omitted variable that affected both grade 5 achievement and the selection of the grade 5 
teachers, but was uncorrelated with lagged achievement and lagged teachers. Because the variable 
was uncorrelated with lagged achievement, it did not cause the Rothstein test to reject. Column 10 
presents results with uncorrelated errors. In column 11, we show estimates for a model with 
negatively correlated errors that also did not cause the Rothstein test to reject. 40 

Random assignment of individual students to teachers is a clear case in which VAMs yield 
unbiased estimated teacher effects, and the Rothstein test appropriately fails to falsify. Interestingly, 


37 We also ran models with much larger teacher effects (standard deviation of 0.75), larger negative correlations (- 
0.90), and both. The adjusted correlations between the estimated and true teacher effects remained at 0.93 and above in 
these models. 

38 With larger sample sizes we do find bias. Also in results available upon request we simulated data for a model 
very similar to the “baseline” model in Appendix C of the Rothstein paper. We find biased estimates for this model 
which has negatively correlated errors and lag teacher effects. The correlations between the estimated and true teacher 
effect estimates remain well over 0.90 after adjusting for estimation error. This model includes 1.2 million teachers and 
measurement error, as well as teacher and school effects that are correlated across grades. 

39 In columns 7, 8, and 9, the Rothstein 2 test produces future teacher effect estimates with standard deviations 
that are all larger than those presented in Table 1 for the regular Rothstein test. 

40 The standard deviations for the Rothstein 2 teacher effects are similar to those for the Rothstein test presented 
in Table 1. 
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however, random assignment of groups of students to teachers can cause the Rothstein test to falsify 
if students were tracked into those groups. This is shown in the last column of Table 1, which 
reports findings generated when students were tracked into classrooms based on their previous 
classrooms alone, but teacher assignment to classrooms was random. 41 The same arguments for the 
Rothstein test rejecting hold as for the previous models. Indeed, the first three columns of Table 1 
are relevant because they simply describe how students were tracked into grade 5 classrooms, but 
not how grade 5 teachers were assigned to those classrooms. 42 Thus, they would hold if teachers 
were randomly assigned. 


41 The standard deviation of future teacher effects from the Rothstein 2 test was only 0.075 in this case. 

42 In results available upon request, we ran simulations that align with each of those presented in Table 1, but using 
50,000 students per teacher and only two teachers and one school each. The results were generally similar to those in 
Table 1. An important exception is that we did find bias for column 8 (as expected) in that model. 
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IV. CONCLUSION 

As we noted in the outset of this paper, Rothstein’s critique of value-added methods used to 
estimate teacher effectiveness has been cited by both research and policymaking communities as a 
reason to doubt the wisdom of using VAMs for high-stakes purposes. The findings we present here, 
however, call into question whether the Rothstein falsification approach provides accurate guidance 
regarding the bias of teacher effect estimates. 

Ideally, the Rothstein test could be used to identify VAMs that produce biased estimates of 
current teacher effects. Our results suggest this is not possible. Moreover, we find that one cannot 
use the Rothstein test to reject the hypothesis that students were effectively randomly assigned 
conditional on lagged achievement. 

We would argue that Rothstein’s 2010 paper raised important concerns about the ability of 
VAMs to produce unbiased estimates of teacher effectiveness, but the Rothstein test itself (just one 
part of his paper) does not provide useful guidance regarding VAMs. Given this, we believe that 
more work needs to be done to understand the potential for bias in VAMs. This will likely involve 
more investigation into the factors that might cause such bias. One way to approach this topic is to 
look at the various factors affecting student sorting into classrooms other than those typically 
included as controls in VAMs so that such variables could be added to future models. 4 ’ Work like 
this has been done by numerous authors both quantitatively (Jacob and Lefgren 2007) and 
qualitatively (Kraemer et al. 2011). When doing such work, however, researchers may want to keep 
in mind results of Rothstein (2009, 2010) that suggest that some variables often omitted from VAMs 
(for example double-lagged achievement) may affect the selection of current teachers and yet not 
cause much bias. 

From a policy perspective, the important questions may not be whether there is any bias, but 
the magnitude of the bias. It is quite likely that teacher effectiveness estimates generated from VAMs 
are biased to some degree but, as shown in Rothstein (2009), Kinsler (2011), and our simulations, 
the magnitude of bias may be relatively inconsequential. Decisions about using VAMs should 
consider how this bias compares to potential information that value-added models can provide 
about teacher effectiveness over, or in addition to, other means of assessment. 44 


43 Ashenfelter (1978) makes a similar point in his paper, which looks at similar issues outside of value-added 
models. 

44 Value-added estimates may also be very imprecise (Schochet and Chiang 2010). Other measures of teacher 
effectiveness may also be imprecise, so the policy focus should probably be on how best to obtain more precise 
estimates of teacher performance. This may involve using some combination of VAM and non-VAM measures. Indeed, 
if well implemented such combinations may be optimal both for reducing bias and for improving precision. 
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APPENDIX A 

WHEN FALSIFICATION TESTS FAIL 

In this appendix, we elaborate in more detail why the Rothstein test can incorrectly falsify 
correctly specified VAMs. As discussed in the main body of this paper, the omission of variables 
used for tracking students need not cause bias if they do not impact current test scores directly, 
although the existence of such omitted variables in combination with negatively correlated error 
terms would suggest bias. Conditions 1 and 2 suggest that the Rothstein test could be used to help 
identify the existence of such omitted variables — in particular, double-lagged achievement and 
lagged teachers. Here, however, we illustrate that the Rothstein test cannot be used to rule out the 
possibility that students were tracked randomly conditional on lagged achievement. To do this, we 
use a one-school, two-teacher example in which tracking decisions depend only on lagged 
achievement and a random error, and there are no omitted variables from the model. We start with 
an equation for R lg , the coefficient on the better (that is, more effective) “future” teacher from the 
Rothstein test (see footnote 22). 

Ri g = cov(T lig ,A i(g . 2) , | Aife-y) / var ( t^ | A i(g4) ) = cov(T* lig ,A* (g . 2) )/Var(T* lig ) 
where A* i(g _ 2) and T* lig are the residuals that result from regressing A i(g _ 2) and T lig on A i(g4) . 

As shown in the main text, one component of cov(T* lig ,A* i(g _ 2) ) is cov(T* lig ,£* i(g4) ). If this 
component is non-zero, then the Rothstein test can reject incorrectly (condition 3). 

Condition 3 would not occur if lagged achievement and its error term were jointly normally 
distributed. In this situation, the expected value of the error would be a linear function of lagged 
achievement and the residual error would be uncorrelated with any function of lagged achievement 
(linear or nonlinear). 45 But, as we show below, it is quite likely that lagged achievement is not 
normally distributed because it is itself impacted by lagged teachers. The impacts of these teachers 
depend on the fraction of time a student is assigned to a teacher, which is necessarily bounded 
between zero and one, and therefore is not normally distributed. This means that the expected value 
of the lagged error (one element in the equation for condition 3) is likely to be a nonlinear function 
of lagged achievement. The current teacher (the other element of condition 3) is also a nonlinear 
function of lagged achievement because assignment to the current teacher is also a non-normally 
distributed variable, that is, students either are, or are not, assigned to a given teacher. Since both the 
lagged error and current teacher are nonlinear functions of lagged achievement, they can remain 
correlated even after conditioning on lagged achievement in a linear regression. 

In showing a more formal description of how Rothstein’s test can reject because of condition 3, 
we make the simplifying assumption that the lagged error is jointly normally distributed with double 


45 If the current teacher is a function of lagged achievement and an uncorrelated error, its residual (controlling for 
lagged achievement) would also be uncorrelated with the lagged error. 
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lagged achievement. 46 Given this joint normality assumption, we know that for either lagged 
teacher’s students, the expected value of A i(g 2) is a linear function of A i(g4) . Let these functions be: 

E(A i(g _ 2 ) | x li(g4) = 1, A i(g4) ) = oy+A-jA^ jj for students with the better lagged teacher, and 

E(A i( g_ 2 ) | T li(g4) = 0, A ife4) ) = ao+A, 0 A i(g4) for students with the omitted lagged teacher. 

Now consider the function for the probability of having the more effective lagged teacher as a 
function of lagged achievement. This is nonlinear since T lj[g4) is a discrete variable. Thus, 

Li(g-i) - Ti(A i(g4) ) 

Note that this is not a tracking function because we are describing a function that relates the 
lagged teacher to the lagged test score — and not to the double-lagged test score. 

Finally, the equation for the expected value of A i(g _ 2) as a function of A j(g4) that combines both 
sets of students can be written as follows: 

E(Ai(g- 2 ) | A i(g4) ) — (a 1 +A 4 A i(g4) )T 1 (A i(g4) ) + (ao+X 0 )A i(g4) T 0 (A i(g4) ) 
where T 0 (A ife4) ) = 1- T^A^). 

T ( ,0 and T,Q are both nonlinear functions of A i(g4) . Thus, E(A i(g _ 2) | A i(g4) ) is a nonlinear function 
of A i(g l) . Since both E(A i(g _ 2) | A i(g4) ) and T lg are nonlinear functions of A i(g4) , this suggests that they 
could be correlated even after controlling for A i(g4) linearly. This can, in turn, cause the Rothstein 
test to reject incorrecdy. 47 


46 We use this example to show how condition 3 might hold because neither of the variables that appear in 
conditions 1 or 2 affect tracking in this example (double-lagged achievement or the lagged teacher). However, it is also 
true that conditions 1 and 2 might hold in this situation. This can happen because of nonlinearities similar to those 
discussed here. 

47 Allowing Ai(g_ 2 ) to be non-normal could introduce additional nonlinearities that could also cause the Rothstein 
test to reject incorrectly. 
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APPENDIX B 

NEGATIVELY CORRELATED ERRORS NEED NOT CAUSE BIAS 

In this appendix, we show that negatively correlated errors need not cause bias if lagged 
achievement scores are normally distributed. To make this point, we first show that negatively 
correlated errors may result in an omitted variable that is a linear function of the lagged achievement 
score. An omitted variable with this property will bias the lagged achievement coefficient estimate, 
but will not cause bias in the teacher effects, a point that is well known. This is important because it 
means that the negative correlation in errors across grades assumed by Rothstein need not cause bias 
on its own. 

To investigate these issues, we consider a model in which the error terms are negatively correlated \ as 
Rothstein posits. This could happen if students who had an above-average error term in the 
previous period forget more than students with an average error term in direct proportion to how 
far above average they were in the previous period. Similarly, those who had a below-average error 
term in the previous period learn more than students with an average error term (perhaps from 
other students or their teacher), again, in direct proportion to how far they were below average in 
the previous period. 48 Mathematically, this can be written as: 

*ig = W l U ig +W 2U ife 4) 

where cov(u lg ,u ](g4) ) = 0 and w 2 < 0. 

This implies that 

(B.l) 49 cov(e ig ,<? i(g4) ) = C 02 var(u i(g4) ) < 0. 

As in our simulations, we assume that lagged achievement depends only on lagged teachers, an 
error term, and a random starting point (A j(g 2 .) that is normally distributed. 50 Thus: 

A « = A i(g-i) X+T iigPig +e ig 

A i(g-1) ~ A i(g-2)^ + T li(g-l)Pl(g-l) + ^(g-l) 


48 This is sometimes described as “regression to the mean,” although that phrase is sometimes used to describe 
situations in which the true errors are uncorrelated across grades. 

49 In the following five lines, we derive equation (B.l). 

cov( ft g , e<g-i) ) 

COv(p 6 'iu(g-l) 4 _ ^iug, P eiu(g-2)~^~ eiu(g-l)) 

P - COv((?iu(g-l), £iu(g-2)) "f" p C O V (£1 u !g- 1 ) , £j u(g- 1 )) V p COvhiug f ,(?iu(g-2)) COv(ftug ; , ^iu(g-l)) 

= p 2 0 + p var(£u(g-ij) + p 0 

= p var(ein(g-i)) < 0 . 

50 This can be justified if we think of two grade levels in the past as the grade when the child entered school and 
that there was no tracking in that grade. All subsequent learning is captured by later teacher effects and error terms. 
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To generate normally distributed baseline scores (and unbiased estimated teacher effect 
estimates), we assume that lagged teachers have no impact on lagged achievement levels. Thus, 

where A i(g _ 2) , e l(gA) and e ig are jointly normal and uncorrelated with each other by assumption. This 
implies that A l(g _^ and e ig are also jointly normal and linearly related because a linear function of two 
jointly normally distributed variables is also jointly normal and linearly associated with each of those 
variables (Goldberger 1991; Theil 1957). Indeed, all four variables are jointly normal and linearly 
related. 


\g~ 2 ) 

et g 


N(//,E) 




In particular, the expected value of e tg is a linear function of A i(g4) . Given this, let er lg be the 
residual from a regression of e ig on A i(g4) 

%= rA^D+^ig 

where y is the coefficient on lagged achievement. 51 

Now we need to show that er ig is uncorrelated with T ig , the current teacher. We start by assuming 
a specific functional form for T rig 

(B.2) T lig = 1 if A (ig4) >0 and 0 otherwise. 

We then show that er ig is uncorrelated with any function of A i(g4) and therefore is uncorrelated 
with T Ug . 

By assumption, e ig and A i(g4) are joindy normal. By constmction, er lg is a linear function of these 
variables equal to e ig -'kA ig . Thus er ig and A i(gl) are also joindy normal. By constmction, er ig is also 
uncorrelated with A i(g4) . Joint normality and zero correlation implies independence. This, in turn, 
means that er ig is uncorrelated with any function of A i(g4) regardless of whether it is linear or 
nonlinear. Since T rig is a function of A i(g4) , it is also uncorrelated with er ig P 2 

Using the symbols from equation 2 in the main body of this paper (the omitted variable bias 
formula), this means that cov(<?* lg ,T* tig ) = 0. To see this, note that er ig is the residual that remains after 


51 We expect y to be less than 0 since cov(ei g /i(g.i)) is less than 0. 

52 In a more realistic scenario, T]i g would also depend on some additional variables. As long as they are also 
distributed independently of «r lg> then er ig will remain conditionally uncorrelated with Ti g . 
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regressing e ig on A i(g4) . 53 Thus, er lg is the same as e* vg . Similarly, x* llg is the residual that remains after 
regressing x lig on A i(g l) If e* [g , is uncorrelated with x lig then it will also be uncorrelated with x* lig 
because x* llg is a linear function of x lig and A i(g4) and <A g is uncorrelated with both of those variables. 

The negative correlation in errors means that students who scored lowest on the previous test 
will score somewhat higher than otherwise expected in the current period and vice versa. The 
coefficient estimate on A i(g4) will be biased downwards, but, in this case, the negative correlation has no 
impact on the coefficient on z 1ig . 

Some readers might also be concerned about the plausibility of the data generation process we 
propose for tracking because it appears to depend on a latent variable that is a linear function of 
lagged achievement. However, as noted above, tracking is necessarily a nonlinear function of this 
latent variable. The functional form we propose allows for this. More precisely, one could think of 
tracking as a two-stage system in which tracking depends on a latent variable that, in turn, depends 
on lagged achievement. This can be written either by stage or in a single stage by substituting out for 
the latent variable (LV). Thus, 

Stage 1: LV = (3T*A i(g4) 

Stage 2: x lig = T(LV) 

Combined: x lig = T((3T*A i(g4) ) 

where pT* is the coefficient on lagged achievement in Stage 1 . 

We have assumed that the first stage is linear. However, even if the first stage were nonlinear 
this would not affect our argument because the second stage is nonlinear. Thus, nonlinearity in the 
first stage is not an issue. What is key to our argument for this appendix — that a “plausible” data 
generation process can yield unbiased results — is that the relationship between the current error 
term and lagged achievement is linear. 


53 In the main body of this paper, we discussed creating residuals by regressing each variable on lagged 
achievement and the other teacher dummies. In this model, there are no other teachers because there are only two 
teachers and one is omitted. 
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